Technical Report: MapReduce-based Similarity Joins
نویسندگان
چکیده
Cloud enabled systems have become a crucial component to efficiently process and analyze massive amounts of data. One of the key data processing and analysis operations is the Similarity Join, which retrieves all data pairs whose distances are smaller than a predefined threshold ε. Even though multiple algorithms and implementation techniques have been proposed for Similarity Joins, very little work has addressed the study of Similarity Joins for cloud systems. This paper focuses on the study, design and implementation techniques of cloudbased Similarity Joins. We present MRSimJoin, a MapReduce based algorithm to efficiently solve the Similarity Join problem. This algorithm efficiently partitions and distributes the data until the subsets are small enough to be processed in a single node. The proposed algorithm is general enough to be used with data that lies in any metric space. The algorithm can also be used with multiple data types, e.g., numerical data, vector data, text, etc. We present multiple guidelines to implement the algorithm in Hadoop, a highly used open-source cloud system. The extensive experimental evaluation of the implemented operation shows that it has very good execution time and scalability properties.
منابع مشابه
SIGMOD RWE Review ”Efficient Parallel Set-Similarity Joins Using MapReduce”
This document is a review report on the paper ”Efficient Parallel Set-Similarity Joins Using MapReduce” by R. Vernica, M. Carey, C. Li by Sigmod’s 2010 Repeatability and Workability Evaluation Committee. In this section the provided resources (code, data sets, setup information) and hardware setups of the authors and reviewers are discussed. Detailed information on all experiments that the revi...
متن کاملComparing MapReduce-Based k-NN Similarity Joins on Hadoop for High-Dimensional Data
Similarity joins represent a useful operator for data mining, data analysis and data exploration applications. With the exponential growth of data to be analyzed, distributed approaches like MapReduce are required. So far, the state-of-the-art similarity join approaches based on MapReduce mainly focused on the processing of low-dimensional vector data. In this paper, we revisit and investigate ...
متن کاملEfficient and Scalable Graph Similarity Joins in MapReduce
Along with the emergence of massive graph-modeled data, it is of great importance to investigate graph similarity joins due to their wide applications for multiple purposes, including data cleaning, and near duplicate detection. This paper considers graph similarity joins with edit distance constraints, which return pairs of graphs such that their edit distances are no larger than a given thres...
متن کاملMapReduce Based Personalized Locality Sensitive Hashing for Similarity Joins on Large Scale Data
Locality Sensitive Hashing (LSH) has been proposed as an efficient technique for similarity joins for high dimensional data. The efficiency and approximation rate of LSH depend on the number of generated false positive instances and false negative instances. In many domains, reducing the number of false positives is crucial. Furthermore, in some application scenarios, balancing false positives ...
متن کاملHeads-Join: Efficient Earth Mover's Distance Similarity Joins on Hadoop
The Earth Mover’s Distance (EMD) similarity join has a number of important applications such as near duplicate image retrieval and distributed based pattern analysis. However, the computational cost of EMD is super cubic and consequently the EMD similarity join operation is prohibitive for datasets of even medium size. We propose to employ the Hadoop platform to speed up the operation. Simply p...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012